CoDoSA: A Lightweight, XML-Based Framework for Integrating Unstructured Textual Information

نویسندگان

  • John R. Talburt
  • Eric D. Nelson
چکیده

One of the most fundamental dimensions of information quality is access. For many organizations, a large part of their information assets is locked away in Unstructured Textual Information (UTI) in the form of email, letters, contracts, call notes, and spreadsheet. In addition to internal UTI, there is also a wealth of publicly available UTI on websites, in newspapers, courthouse records and other sources that can add value when combined with internally managed information. This paper describes a system called Compressed Document Set Architecture (CoDoSA) designed to facilitate the integration of UTI into a structured database environment where it can be more readily accessed and manipulated. The CoDoSA Framework comprises an XML-based metadata standard and an associated Application Program Interface (API). It further describes how CoDoSA can facilitate the storage and management of information during the ETL (Extract, Transform, and Load) process to integrate unstructured UTI information. It also explains how CoDoSA promotes higher information quality by providing several features that simplify the governance of metadata standards and enforcement of data quality constraints across different UTI applications and development teams. In addition, CoDoSA provides a mechanism for inserting semantic tags into captured UTI, tags that can be used in later steps to drive semantic-mediated queries and processes.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Apply Uncertainty in Document-Oriented Database (MongoDB) Using F-XML

As moving to big data world where data is increasing in unstructured way with high velocity, there is a need of data-store to store this bundle amount of data. Traditionally, relational databases are used which are now not compatible to handle this large amount of data, so it is needed to move on to non-relational data-stores. In the current study, we have proposed an extension of the Mongo...

متن کامل

Apply Uncertainty in Document-Oriented Database (MongoDB) Using F-XML

As moving to big data world where data is increasing in unstructured way with high velocity, there is a need of data-store to store this bundle amount of data. Traditionally, relational databases are used which are now not compatible to handle this large amount of data, so it is needed to move on to non-relational data-stores. In the current study, we have proposed an extension of the Mongo...

متن کامل

TEXTUAL AND INTER-TEXTUAL ANALYSES OF IRANIAN EFL UNDERGRADUATES’ TYPES OF ENGLISH READING TOWARDS DEVELOPING A CAREFUL READING FRAMEWORK

This study investigated textual and inter-textual reading of a group of Iranian EFL undergraduates’ careful English reading types. In this research, Khalifa and Weir’s (2009) reading framework was used to propose a more inclusive aspect of a careful reading framework and the reading construct for instructional and assessment goals. The participants of this study were B.A. students of English Tr...

متن کامل

A framework for XML similarity joins

A prime motivation for using XML to directly represent pieces of information is the ability of supporting ad-hoc or “schema-later” settings. In such scenarios, modeling data under loose data constraints is essential. Of course, the flexibility of XML comes at a price: the absence of a rigid, regular, and homogeneous structure makes many aspects of data management more challenging. Such malleabl...

متن کامل

Shallow, Deep and Hybrid Processing with UIMA and Heart of Gold

The Unstructured Information Management Architecture (UIMA) is a generic platform for processing text and other unstructured, human-generated data. For text, it has been proposed and is being used mainly for shallow natural language processing (NLP) tasks such as part-of-speech tagging, chunking, named entity recognition and shallow parsing. However, it is commonly accepted that getting interes...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009